After successfully accomplishing the exercises on tidy data and listening to lengthy lectures on data formats as well as specifics of importing them, it’s now your turn to get used to importing data in the tidyverse.
We prepared some datasets, for example, the Titanic dataset from Kaggle, which you can use to play with some of the functions from readr and related packages. You can find them in the ../data folder. However, importing data often implies firing up only one command, and that’s it. For this reason, in these exercises, we prepared some individual tasks you can work on.
This being said, let’s start with some easy data importing.
readr library and the function read_...
library(readr)
titanic <-
read_csv("../data/titanic/titanic.csv")
## Parsed with column specification:
## cols(
## PassengerId = col_double(),
## Survived = col_double(),
## Pclass = col_double(),
## Name = col_character(),
## Sex = col_character(),
## Age = col_double(),
## SibSp = col_double(),
## Parch = col_double(),
## Ticket = col_character(),
## Fare = col_double(),
## Cabin = col_character(),
## Embarked = col_character()
## )
You may have noticed that the function you just used is importing factor variables as characters by default. For some analyses, this is not what we want. So let’s pretend we’re particularly interested in gender differences in a regression model or the like.
Sex to a factor.
titanic <-
read_csv(
"../data/titanic/titanic.csv",
col_types = cols(
Sex = col_factor()
)
)
After working on the titanic data, we got bored. Now we want to work on some longitudinal and cross-country level data. The gapminder GDP data comes to our mind!
library(readxl)
gapminder_GDP <-
read_excel("../data/gapminder/GDPpercapitaconstant2000US.xlsx")
Although you had to apply two different importing functions, the outcome is no different: what you got are tibbles. However, especially the file format of the latter dataset is more complex. In the last exercise, we expand on that and apply some more options with the help of the unicorns on unicycles data. What we know about these data is this little story:
The documents were recently unearthed from a hidden chest in Delft and seem to be written by Rudolphus Hogervorstus, my great great great uncle, in 1681. These documents show that he was a scientist studying the then roaming herds of unicorns in the area around Delft. Unfortunately these animals are extinct now. His work contains multiple tables, carefully written down, documenting the population of unicorns over time in multiple places and related to that the sales and numbers of unicycles in those countries. According to Rudolphus the unicorn populations and unicycles are related “The presence of the cone on the unicorn hints at a very defined sense of equilibrium, it is therefore only natural to assume unicorns ride unicycles”. As part of the archival process these tables were copied, as Rudolphus himself would say: “with the black magic, so vile it could not be discussed for hell would come descent upon us” into satans own spawn: Microsoft Excel. This ‘raw’ data gives us a nice example of typical dirty data you would find in the wild. The goal is to combine the sales data of unicycles and the populations of unicorns into a single ‘tidy’ dataframe.
Source: https://github.com/RMHogervorst/unicorns_on_unicycles
total_turnover variable only read in the cell range A1:C43
range = range_definition.
library(readxl)
unicorn_sales <-
read_excel(
"../data/unicorns/sales.xlsx",
range = "A1:C43"
)